Structurally constrained protein evolution: results from a lattice simulation. 



O 

o 
o 

(N 



43 

o 



03 
> 

CO 

o3 

a 

C 

o 

(N 
> 

m 
m 
(N 

On 
On 

03 



13 

o 
o 



X 
S3 



Ugo Bastolla^ 1 - 1 , Michele Vendruscolo' 2 - 1 and H. Eduardo Roman' 3 ' 
FU Berlin, Inst, fur Kristallographie, Takustr. 6, D-14195 Berlin, Germany 
^Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel 
^INFN, Sezione di Milano, Universita di Milano, Via Celoria 16, 1-20133 Milano, Italy 

We simulate the evolution of a protein-like sequence subject to point mutations, imposing 
conservation of the ground state, thermodynamic stability and fast folding. Our model is aimed 
at describing neutral evolution of natural proteins. We use a cubic lattice model of the protein 
structure and test the neutrality conditions by extensive Monte Carlo simulations. We observe that 
sequence space is traversed by neutral networks, i.e. sets of sequences with the same fold connected 
by point mutations. Typical pairs of sequences on a neutral network are nearly as different as 
randomly chosen sequences. The fraction of neutral neighbors has strong sequence to sequence 
variations, which influence the rate of neutral evolution. In this paper we study the thermodynamic 
stability of different protein sequences. We relate the high variability of the fraction of neutral 
mutations to the complex energy landscape within a neutral network, arguing that valleys in this 
landscape are associated to high values of the neutral mutation rate. We find that when a point 
mutation produces a sequence with a new ground state, this is likely to have a low stability. Thus 
we tentatively conjecture that neutral networks of different structures are typically well separated 
in sequence space. This results indicates that changing significantly a protein structure through a 
biologically acceptable chain of point mutations is a rare, although possible, event. 



I. INTRODUCTION 

Almost unmistakingly, naturally occurring proteins 
with sequence similarity larger than 40% adopt similar 
folds 0. Since natural proteins with homologous se- 
quences descend from a common ancestor, this obser- 
vation indicates that protein structures are significantly 
conserved in evolution. Indeed, several proteins with dif- 
ferent functions show a remarkable structural similarity 
of evolutionary origin even if their sequences can not any- 
more be recognized as related ||||. A recent study on 
the Protein Data Bank (PDB) showed that the typical 
sequence similarity between proteins with the same fold 
is about 8.5% [Q, only slightly larger than for a random 
pair of sequences Q . In this set also proteins with com- 
mon ancestors are likely to exist. These observations cue 
to the fact that during evolution, there is a strong mem- 
ory for the structure but only a very loose memory for 
the sequence. 

The neutral theory of molecular evolution, proposed 
in 1968 by Kimura Q and, independently, by Jukes 
and King Q|, is well consistent with these observations. 
Kimura suggested that most amino acid substitutions in 
protein sequences are selectively neutral, i.e. indistin- 
guishable from the wild type from the phenotypic point 
of view, and are fixed by chance in biological populations 
H . This hypothesis has been heatedly debated in the ge- 
netic literature (§]||. Strictly speaking, conservation of 
the fold and neutral evolution are not equivalent, since 
neutrality deals with the activity of the protein, concen- 
trated in its active site, more than with its structure. 
Moreover drastic changes in the environment can mod- 



ify the selective value of protein structures. However, 
since our model does not represent biological activity, we 
assume in the following neutrality to be synonymous of 
structure conservation. 

In the last decade, the possible occurrence of neutral 
evolution has been revealed by a series of computational 
and analytic studies of the sequence to secondary struc- 
ture relationship for RNA molecules pQ] , An exponen- 
tially large number of sequences correspond on average 
to a single structure, and the distribution of the num- 
ber of sequences per structure is quite broad (following 
a power law), with the most common structures form- 
ing connected neutral networks which percolate sequence 
space. 

For proteins, the sequence to structure relationship is 
much more difficult to study than for RNA. Shakhnovich 
and Gutin jll) , using the Random Heteropolymer model, 
argued that the probability that a point mutation is neu- 
tral (i.e. it does not alter the native state) is non van- 
ishing even for very long sequences. In the same spirit, 
Tiana et al. fl^ l considered a cubic lattice model and a 
sequence with 36 residues, optimized in such a way that 
its ground state coincides with a target structure and is 
very stable |13|, and estimated that 70% of the point 
mutations are neutral. Bornberg-Bauer |l4| studied a 
two-dimensional HP model with only two residue types 
and chain length TV = 18 by using exact enumeration. 
He found, in analogy with the RNA case, that the distri- 
bution of the number of sequences per structure is very 
broad, but sequences corresponding to the same struc- 
ture are clustered in small regions of sequence space. 

We simulate the evolution of a protein sequence sub- 
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ject to structure conservation. Mutations that change 
the protein's native structure, identified with the ground 
state of the model, are considered lethal and are rejected. 
In this way, our sequence follows an evolutionary trajec- 
tory on a neutral network, i.e. a set of sequences sharing 
the same fold and connected by point mutations. While 
the structure (fold) is conserved, the sequence changes as 
new mutations are accepted, and after a sufficient num- 
ber of steps along the evolutionary trajectory have been 
performed, the sequence behaves essentially as a random 
one with respect to the original one jljj . 

It is important, however, to impose not only the con- 
dition that the native state is conserved but also that its 
stability remains high and the folding time remains low. 
These conditions are not only biologically relevant, but 
also help the model protein to diffuse in sequence space. 

We use in this study a lattice representation of protein 
conformations, because only in this way the ground state 
and its thermodynamic stability can be reliably deter- 
mined, but we believe that our simplified model repro- 
duces the generic features of the evolution of real protein 
sequences. 

Support to our results comes from a recent study by 
Babajide and coworkers |ll|, who found evidence for the 
presence of neutral networks in protein sequence space. 
Their work is similar in spirit to the present one, but 
rather different methodologically. Real protein structures 
were represented through the C a and Cp coordinates 
taken from the PDB, and an approximate criterion of 
fold recognition based on the Z score J!?]] was used. Fur- 
ther support also comes from the work of Govindarajan 
and Goldstein who introduced the "foldability" 

landscape in order to describe molecular evolution. In 
the language of Govindarajan and Goldstein the foldabil- 
ity of a protein represents its fitness for survival during 
evolution and it is related to the stability and to the ki- 
netic accessibility of the native state. Govindarajan and 
Goldstein also found that their evolutionary dynamics in 
sequence space was confined inside "neutral networks" . 

Tha main result of our paper, namely the fact that the 
fraction of neutral neighbors strongly fluctuates inside 
the neutral network, and that these fluctuations can be 
related to the foldability landscape, should also be put 
in relation with the recent preprint by Tiana et al. ]2C| ] 
where it is shown that the energy of a target structure has 
a complex landscape with valleys and barriers in sequence 
space. The present work supports such a picture by using 
the same protein model but employing rather different 
methods of investigation. 

We already presented some results on neutral networks 
in Ref. jl5). Here we focus our attention on the issue of 
the stability of the native state, relating it to the charac- 
teristics of the evolution. In Sec.[ll[ we describe our model 
protein and our protocol to simulate neutral evolution. 
In Sec. Ill we summarize our previous results. In Sec. IV 



tionary dynamics. Sec.^ presents an overall discussion, 
relating our results to biological observations. 



II. A SIMPLE MODEL OF PROTEIN 
EVOLUTION 

In this section we define the lattice model used to rep- 
resent protein structure and the algorithm introduced to 
simulate evolution in sequence space. 



A. Lattice model of protein structure 

To investigate the correspondence between sequences 
and structures we use a lattice model with twenty amino 
acid types. We consider sequences of length N = 36, de- 
noted by the symbol S = {si,...,sjv}, where Si belongs 
to a twenty- letter alphabet. Configurations are repre- 
sented by self avoiding walks on the simple cubic lattice, 
where each occupied site represents an amino acid. An 
energy E(S,C) is assigned to configuration C of sequence 
S according to the rule: 



E(S,C) 
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E 



C^ij U (ySi J Sjj , 



(1) 



where U(a,b) is a 20 x 20 symmetric interaction ma- 
trix expressing the contact interactions of amino acids 
of species a and b. We use an interaction matrix U (a, b) 
derived from the Miyazawa-Jernigan interaction matrix 
The matrix C = {C y } = /(C), called the contact 
map of configuration C, has elements Cjj equal to one if 
residues i and j are nearest neighbors on the lattice but 
not along the chain and zero otherwise. The similarity 
between contact maps is measured through the overlap 
q(C, C), defined as 



c i<j 



(2) 



we describe the properties of the sequences generated, 
dividing them in four classes. This section focuses on 
the relation between thermodynamic stability and evolu- 



where N* is the maximal between N c and iV^, the num- 
ber of contacts respectively of two contact maps C and 
C, and N c — J2j>i Qj- With this definition, two maps 
are identical if and only if q = 1. Note that this does not 
imply in general identity of configurations. Nevertheless, 
we use the overlap as a measure of similarity in config- 
uration space because structures with the same contact 
map are degenerate in energy and for compact structures, 
only small conformational fluctuations are allowed when 
the entire set of contacts is specified. Moreover, such 
structural fluctuations might play an important role in 
protein functionality [ p2[ . 

The native structure of sequence S is identified with 
the ground state of the model if it is thermodynamically 
stable. We evaluate stability by measuring the average 
overlap (q) between the ground state and the Boltzmann 
ensemble of structures: 
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( q ) = ±j2«( c °>f( c ^ E(c ' s)/T > (3) 

c 

where Co is the contact map of the ground state, f(C) is 
the contact map of configuration C and Z is the partition 
function. This quantity is close to one if all the low energy 
structures are quite similar to the native state. In this 
case the energy landscape of the model is well correlated, 
and the sequence is also expected to be a good folder. 

FIG. 1. The "native state" of the model protein. The ini- 
tial residue is the one in the bottom left corner. 

We consider the target contact map C* represented in 
Fig.|l|. It has N c = 40 contacts, the maximal number 
of contacts possible for a chain of length N = 36. In 
this case the contact map defines uniquely the configu- 
ration of the system. The contact map C* was stud- 
ied by Shakhnovich and coworkers in a computer exper- 
iment of inverse folding p3| . They designed a sequence 
with ground state on C* using the procedure of Ref. p3[ , 
and showed that S* has good properties of kinetic fold- 
ability and thermodynamic stability at the temperature 
where the folding is fastest. The lower part of the en- 
ergy landscape of this sequence is remarkably smooth: 
all the structures with low energy have a high overlap 
q with the ground state. The lowest energy of configu- 
rations with a fixed value of q decreases regularly as q 
approaches one. This correlated energy landscape, remi- 
niscent of the "funnel" paradigm , is the reason of the 
good folding properties of the sequence, which is very dif- 
ferent from a random sequence. In jl2j it was shown that 
the same sequence is also very stable against mutations. 
It was estimated that about 70% of the point mutations 
performed on S* result in new sequences with exactly the 
same ground state and good folding properties. Thus en- 
ergy minimization makes C* stable not only in structure 
space, but also in sequence space. 

We note that C* is an atypical structure for the in- 
teraction parameters that we choose: since U (a, b) has 
average value zero and variance 0.3, one would expect 
open structures to be energetically favored. Indeed, typ- 
ical random sequences with N — 36 and Gaussian contact 
interactions whose average vanishes have a ground state 
with approximately 29-33 contacts p5| , being thus less 
than maximally compact. 

B. Sequence space 

In this study we consider only point mutations, thus all 
sequences have the same length N = 36 and the metric 
in sequence space is given by the Hamming distance, 

N 

D(S,S') = £[l-<5( Si X)] , (4) 

i=l 



where S is the Kronecker symbol and Si takes 20 different 
values, one for each amino acid. A measure of sequence 
similarity is then given by the overlap Q(S, S'), 

1 N 

Q(S,S') = -J2Hs i ,s' i ), (5) 

i=l 

which is equal to one minus the normalized Hamming 
distance. 

We introduce also the distance D#p(S,S') and the 
overlap Q#p(S,S') to measure differences in hydropho- 
bicity. These are defined by transforming every sequence 
into a sequence of binary symbols, either H or P, accord- 
ing to the hydrophobicity of the residue. We consider 8 
hydrophobic amino-acids and 12 polar ones. The defini- 
tions of Dhp and Qhp are analogous to those of D and 
Q, where now Si can take only two values. 

C. Evolutionary process 

Our protein sequence evolves through point mutations 
subject to conservation of the target contact map C*, 
representing the biologically active native structure [ p6| . 
We impose this condition by simulating the following it- 
erative procedure: 

1. At t = we start from S(0) = S*, which has C* as 
its "native state" . 

2. At time t we mutate at random one amino acid in 
S(t — 1), producing a new sequence S'(t). 

3. We submit the new sequence to selection according 
to the criteria specified below. If the sequence is 
accepted then S(t) = S'(i), otherwise we restore 
S(t)=S(i-l). 

The selection step is governed by the following three con- 
ditions 

Conservation of the ground state . The ground 
state C of S'(i) must have an overlap with C* equal 
or larger than a given "phenotypic threshold" Qthr- 

q(C*,C)>q thl , (6) 

In our calculations we imposed strict conservation 
of the native state by setting q t hr = 1- 

Thermodynamic stability . We define thermody- 
namic stability through the condition 

(g(C*,C)> ><g)thr, (7) 

where (•) represent a Boltzmann average at the 
temperature T of the simulation and (<?)thr is a fixed 
parameter. This condition implies that all the ther- 
modynamically relevant states are very similar to 
the target state. 
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Kinetic accessibility . The structure C* must be 
reached in a limited number of steps of our Monte 
Carlo algorithm, in at least two independent at- 
tempts. 

For the test of the sequences we used the PERM method 
p7|p8| , a Monte Carlo algorithm particularly suited for 
finding the ground state of lattice polymers. Note that 
there is no bias towards C* in our Monte Carlo algorithm, 
i.e. it has the same a priori probability of being visited 
as any other possible structure. We remark that other 
schemes of simulations are also suitable to the same ef- 
fect, as e.g. the Monte Carlo algorithm used in Ref. [p9| . 
Such Monte Carlo method with moves in configuration 
space is more suitable than PERM to estimate folding 
times. However, due to computational limitations, wc 
did not try to measure accurately the folding time, thus 
we adopted the PERM method, which is faster for the 
task that is interesting for us. 

The test of a new sequence S is divided into three 
phases: 

• We discard S if after m iterations C* is not reached 
or if other structures of energy lower than C* are 
found. 

• Otherwise we continue to run the algorithm for an- 
other m iterations and discard S if we find struc- 
tures of energy lower than C*. 

• If S passed the first two phases, we run again and 
independently the MC algorithm for a time 2m and 
accept S if also this time C* is found as the lowest 
energy structure. 

Thus for each accepted sequence we run the algorithm 
for 4m steps, with m = 124000. We never found in the 
second independent run of the MC algorithm a struc- 
ture with lower energy than the putative ground-state C* 
found in the first run. This fact encourages us to believe 
that the algorithm was effective in finding the ground 
state. Another support to this conclusion comes from 
the fact that, as it will be discussed later in more detail, 
all of the selected sequences have a remarkably corre- 
lated energy landscape, which makes the task of finding 
the ground state easier. 

On the other hand, whenever the sequence was re- 
jected, we are less sure that we were able to determine 
its ground state. The difference is due to two reasons: 
first, we investigate rejected sequences on the average for 
a shorter time. Second, rejected sequences have typi- 
cally a less correlated energy landscape, so that the de- 
termination of the ground state should be more difficult. 
Nevertheless, we shall present in Sec. IV also data about 



rejected sequences, since they are interesting and refer to 
a very large number of sequences, even if they are indi- 
vidually not completely reliable. 

The three conditions for the acceptance of a mutation 
enforce the conservation of the fold of the protein. This 



is similar to neutral evolution where the biological activ- 
ity of the mutated sequence does not vary. Nevertheless, 
conservation of the fold is not a necessary condition for 
selective neutrality in real proteins, although a very high 
degree of conservation is usually observed, and it is not 
even a sufficient one, since - in the case of enzymes - 
the active site has also to be conserved and the environ- 
ment has to remain reasonably stable. Thus our model 
represents the neutral evolution of the part of the chain 
not involved in chemical activity, in a stable chemical en- 
vironment. Despite its simplifications, we believe that 
our model captures important features of structural con- 
straints in the neutral evolution of proteins. 



III. NEUTRAL NETWORKS 

In this section we summarize results regarding the dif- 
fusion in sequence space under our model of neutral evo- 
lution. More details have been given in Ref. p5[. 



A. Hamming distance 

An interesting result of our simulation is that se- 
quences originated from the same common ancestor di- 
verge so much that their similarity is almost as low as for 
a random pair of sequences while their structures remain 
unchanged. Starting from the same sequence we gener- 
ated eight realizations of neutral evolution, simulating 
the phylogenetic radiation of eight species from a com- 
mon ancestor. We use the following values of the selection 
parameters: phenotypic threshold qthr = 1, correspond- 
ing to exact conservation of the ground state, stability 
threshold (g)thr = 0.90 at a temperature T = 0.16 cho- 
sen so that the folding of the initial sequence S* is fastest 
with our Monte Carlo algorithm. 

The average Hamming distance between the final 
points of the eight evolutionary trajectories is D = 30.2, 
only slightly smaller than the random value D ran = 34.2 
(see FigJ|). However, this quantity had not yet reached a 
stationary value when the simulations were interrupted, 
thus we can not exclude that the long time behavior co- 
incides with D ran . An indication in this sense is the fact 
that the maximum distance between sequences in two 
different trajectories is D = 35. All the residues in the 
original sequence could be substituted at least twice, but 
some are more difficult to change. We define the degree 
of conservation, or rigidity, of residues at the i-th position 
in the sequence as follows: 



Ri — 



E 



(8) 



where Pi{a) is the probability to find the amino-acid a 
at position i. Pi(a) is estimated from the end points of 
the eight neutral paths generated. R4 = 1 if the amino- 
acid at position i is never changed, while Ri « 1/8 if 
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it is completely random. We found that several posi- 
tions have rigidity compatible with the random value, 
and no position has R4 = 1. As one would expect, the 
most conserved positions are the two in the interior of 
the structure (see Fig.0). 
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FIG. 2. Histograms of the Hamming distances between the 
end points of eight independent trajectories using the full 20 
amino-acid alphabet (white) and the reduced HP alphabet 
(black) . The vertical dashed lines represent the average values 
for random pairs of sequences. 

The same results hold for the HP representation in 
which hydrophobic (H) and polar (P) amino acids are 
grouped together so that Si can assume only two val- 
ues. The average value of the distance is in this case 
Dhp = 16.3, not far from the random value D r ^p = 17.3, 
and the variance is Vhp = 6.0, compatible with Vhp = 
Dhp{1 — Dhp)/N. It is at first sight surprising that also 
the distance Dhp is close to that expected for random 
sequences. This is in part an effect of the short length of 
our sequences, since only two residues are in the interior 
of the structure, while all other ones are at the surface. It 
is interesting to note, however, that also the two residues 
in the interior of C* have rigidity Ri < 1, even when 
the two letter HP representation is used. The distinction 
between polar and hydrophobic residues is based only on 
the interaction matrix that we use, in which also polar 
residues can have attractive interactions. It is interesting 
to note that a recent study of real protein structures BG] 
found a correlation between amino acids buried in the 
core and evolutionary conserved ones, consistently with 
our results. 



B. Neutral Mutation Rate 

For a given sequence S of N amino acids, we define the 
neutral mutation rate x(S) as the fraction of acceptable 
non-synonymous mutations 



x(S) 



1 



N 1,20 

2(W ^ ^2 Xm 

l — l OL^Si 



(S) 



(9) 



where X a j(S) equals one if assigning the amino acid of 
species a at position i on the sequence S does not change 
the native state, and zero otherwise. Non-synonymous 
mutations are those for which an amino acid is not re- 
placed by itself. 

The simplest measure of the neutral mutation rate is 
obtained by computing the frequency of neutral muta- 
tions over all the non-synonymous mutations proposed. 
In this way we found x w 0.05 (the overline represents 
an average over the mutational process). However, this 
quantity alone is not enough to characterize x(S), which 
fluctuates strongly in sequence space. For instance, it 
was estimated by one of us and coworkers that 
x(S*) w 0.7, where S* is the starting point of our evolu- 
tionary trajectories. 

We measured indirectly the distribution of x(S) in se- 
quence space from the distribution of the "trapping" time 
Tt(S) that a trajectory spends on sequence S. The aver- 
age value of the trapping time is inversely proportional 
to the neutral mutation rate: 

^) = ^)> ^ 

where the bar denotes average over the different muta- 
tions. The distribution of t at fixed a; is a geometric one, 
Px{ T ) — x(l — x) T ~ 1 , so that, averaging over the neutral 
set, we get 



[P(r)] = f dxp 
Jo 



(x) 



1 



{i-xY , 



(11) 



where [•] denotes an average over sequences belonging to 
the neutral network (in this argument we neglect the er- 
ror in evaluating whether a sequence belongs to the neu- 
tral set: in particular, the conditions of fast folding and 
of thermodynamic stability are subject to considerable 
evaluation errors). 

The distribution of r t is broader than an exponential 
one (Fig||), thus, even if we can not invert Eq. [H], we 
expect that the distribution of the neutral mutation rate 
x is also broader than exponential. 



10 




10 



10 



10 3 



5 



FIG. 3. Distribution of the trapping time r*. 

The values of r t for neighboring sequences are rather 
correlated, but the correlation seems to vanish after few 
steps in sequence space (data not shown). 

C. Genetic drift and population genetics 

In Kimura's neutral theory H it is assumed that the 
fraction of neutral mutations x(S) does not depend on the 
sequence. With this hypothesis, the time evolution of the 
Hamming distance D{t) between the starting sequence 
S(0) and the present sequence S(t) is given in our model 
by 

D(t)/N » (l - [1 - exp(-xt/N)} , (12) 

where the time t represents the number of mutational 
events. However, this hypothesis is contradicted by our 
results, which show that the relaxation of the distance is 
not exponential. This fact is due to the large fluctuations 
of the neutral mutation rate along the neutral network. 

This qualitative result can be interesting for the un- 
derstanding of protein sequence evolution. A key issue 
in the theory of molecular evolution is the following: Is 
the rate at which amino acids are substituted in protein 
sequence constant on different branches of the phyloge- 
netic tree? The constancy of the substitution rates was 
proposed by Zuckerkandl and Pauling in their pioneer- 
ing study of molecular evolution as the molecular clock 
hypothesis |3lJ . This hypothesis has been questioned re- 
cently |^,[| , even if it seems to be at least approximately 
valid for several proteins. 

Kimura's theory predicts that neutral substitutions oc- 
cur at a rate r = /j,x which is the product of the bare 
mutation rate \x times the fraction x of neutral muta- 
tions. This rate is independent of the size of the popula- 
tion. The value x characterizes the substitution rate of 
a particular protein but does not vary during evolution. 
Mutations occuring in a geological time T are thought 
to follow a Poissonian statistics with average value fiT, 
so that the number of substitutions is predicted to have 
Poissonian statistic with average value (ns) = fiTx. This 
has the important consequence that the fluctuations of 
the substitution process in different species should in- 
crease only as VT. More precisely, the ratio R(T) be- 
tween the variance and the average value of the number 
of substitutions should be identically equal to one. This 
strong prediction was first tested by Kimura [|| with the 
conclusion that deviations from the Poissonian statistics 
are small. However, more recently Gillespie repeated the 
test for a larger number of proteins, finding that for most 
of them the value of R(T) is much larger than one. He 
thus argued that the hypothesis that most mutations are 
neutral has to be rejected. 



Our results provide an alternative explanation: the 
strong fluctuations of the substitution process can be at- 
tributed to the fluctuations of the neutral mutation rate 
in sequence space, even in the absence of any selective 
pressure. To test this hypothesis, we assume, as above, 
that the number mfe(T) of attempted mutation events 
in a time T during trajectory k is a Poissonian variable 
of average value fj,T. In the present study, k = 1, ...8 is 
the label of the evolutionary trajectories. Then for every 
trajectory k we count the number rifc(T) of mutations ac- 
cepted over TOfc(T) steps of our evolutionary algorithm. 
This number is then interpreted as the number of substi- 
tutions in "species" k. We can thus compute the variance 
and the average value of this variable over the eight tra- 
jectories. The ratio between them gives an estimate of 
the dispersion ratio R(T). This is always larger than 
one, contradicting the Poissonian hypothesis. Moreover, 
R(T) is found to be an increasing function of T, so that 
it is no longer true that the fluctuations grow with time 
as s/T. 

Since our model takes into account only neutral and 
lethal mutations, without considering either advanta- 
geous mutations or slightly deleterious ones, we conclude 
that the violation of Poissonian statistics is not a decisive 
proof against the validity of the neutral hypothesis. 

IV. PROPERTIES OF THE SEQUENCES 

We classify the nearly 12,000 sequences generated by 
our evolutionary algorithm in four classes, with the re- 
minder that the identification of the ground state is only 
tentative for rejected sequences, as already discussed. 

1. Selected sequences, belonging to the neutral net- 
work. Their number is a fraction / = 0.050 of the 
total set. 

2. Unstable sequences, / = 0.172. Their lowest 
energy state coincides with C*, but the stability 
condition is not fulfilled. The rejection was made 
in most cases already after the first MC run, if the 
condition (q(C,C*)) > 0.75 was not fulfilled, oth- 
erwise the sequence was studied in another MC run. 

3. Slow folding sequences, / = 0.472. For such 
sequences, no structure with energy lower than 
E(C*,S) was found, but the MC algorithm did 
not reach the target structure C*. In some cases 
(/ = 0.061) C* was reached in the first MC run 
but not in the second one, while in most cases the 
rejection was made already after the first run. 

4. Structurally mutated sequences, / = 0.306. 

For such sequences the lowest energy structure Co, 
has lower energy than the target structure: 

£(C 0) S) <E(C*,S). (13) 
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Before describing separately the properties of these 
classes of sequences, we show that the value of the Z score 
is able to distinguish statistically the different classes. 
The Z score O] is used to evaluate the match between 
a sequence S and a structure C* taken from a pool of 
alternative structures. It is defined as 



Z(C*,S) 



£(C»,S)-(£(C,S)) 
(^(C,S))-(£(C,S)) 5 



(14) 



where the brackets denote average with respect to the 
ensemble of alternative structures at high temperature. 
The more negative Z is, the better is the match between 
the sequence S and the structure C* and the more stable 
is the structure C*, provided that it is really the lowest 
energy structure. This measure is often used in computer 
experiments of fold recognition |l6| , |l7|. 

Following Mirny and Shakhnovich p2], we use a sim- 
plified measure of the Z score, considering as alternative 
structures only maximally compact structures and ap- 
proximating their average energy and their variance with, 
respectively, the average energy and the variance of the 
set of all possible contacts (we take into account the fact 
that in the simple cubic lattice the only possible contacts 
are those between monomers of different parity). More 
precisely, our definition is 



Z'(C*,S) = 



E{C*,S) -Nc max [U(Si,Sj)] 



(15) 



NCma^iU^S^Sj)] - [U(Si, Sj)Y 



where 



U(Si,Si) 



(16) 



Pij is one if a contact between amino acids i and j is pos- 
sible in some configurations, zero otherwise and Nc max 
is the number of contacts for maximally compact struc- 
tures, Nc m ax — 40 for N = 36 on the cubic lattice. Z' is 
a good approximation to the Z score and it is very easy to 
compute numerically without the need for a simulation 
at a high temperature. 

We plot in figure || the distribution of the Z score for 
the four classes of sequences. For the structurally mu- 
tated sequences we evaluated both the Z score of the 
target structure C* and the Z score of the lowest energy 
structure found, Co- Note that the starting sequence 
S(0) has the lowest value of the Z score, i.e. Z = —1.22. 
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FIG. 4. Distribution of the Z score for different classes of 
sequences. 

The ranking of the Z score for different classes is as 
expected on the basis of their stability. The most nega- 
tive values of Z are proper of selected sequences, which 
are most stable. Next come slow folding and unsta- 
ble sequences. The Z score of mutated ground states, 
Z(Co,S), are less negative. Last in this rank of sta- 
bility comes the Z(C*,S) for the structurally mutated 
sequences. In this case, we are sure that C* is not the 
ground state of the sequence. 
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FIG. 5. Fraction of unstable, slow folding and mutated se- 
quences that would be selected with a criterion based on a 
threshold value of the Z score, Z c . Solid line: fraction of 
sequences selected by our algorithm that would be rejected 
with the same criterion. 

These results confirm that the evaluation of the Z score 
is an efficient criterion to decide whether C* is the ground 
state of a sequence S. Nevertheless, the fact that the dis- 
tributions of the Z score relative to different classes have 
large overlaps should make one worry that the Z score 
is not a precise criterion. In Fig.|| we report the results 
that we would get using a threshold value of Z, Z c , as a 
criterion for fold recognition, instead of studying the se- 
quences with Monte Carlo simulations. The dotted line 
represents the fraction of sequences that are unstable or 
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slow folders according to our criterion and would be ac- 
cepted with a criterion based on the Z score. This goes 
from less than 60% for the most stringent threshold to a 
plateau value of about 80%. We can say very little about 
this class of sequences. It includes sequences that are re- 
ally of lower quality than selected sequences, sequences 
that are of the same quality but were not selected be- 
cause of the uncertainties of the selection procedure and 
sequences which do not fold to the target state. The 
dashed line represents sequences whose ground state is 
surely different from the target one. Their fraction in- 
creases from zero to about 20% as the threshold becomes 
less stringent. Finally, the solid line represents sequences 
that would be selected with our criterion but not with 
the Z score criterion, as a fraction of the total number 
of selected sequences. Our results indicate that a good 
choice for the threshold could be Z c ss —1.07: 14% of 
the sequences selected with this criterion would also be 
selected with our criterion, 80.5% would be sequences 
that do not fulfill the stability or fast folding conditions 
and 5.5% would be sequences which have certainly a dif- 
ferent ground state, but probably similar to the target 
one, since the Z score of mutated sequences is correlated 
to the similarity between the new ground state and the 
target state (see below). About 29% of the sequences 
that our criterion selects would be discarded with the Z 
score criterion. This number becomes much larger if the 
threshold is made more stringent. Thus, the criterion 
based on Z accepts most sequences that we reject and 
rejects a large fraction of those that we select. 

The distribution for the slow folding class is quite sim- 
ilar to that of the unstable class. This is not surprising, 
since it is well known that stability and fast folding are 
correlated in lattice heteropolymer models |53 34 1. In 
particular, stability as we defined it requires a correlated 
energy landscape, which is considered a property of fast 
folding sequences. Thus these results encourage us in be- 
lieving that the conditions we imposed and the algorithm 
to verify them were appropriate. 

Interestingly, the distribution relative to Z(Cq, S) lies 
to the right of the other ones, indicating that the stability 
of the structurally mutated ground states is rather low. 
This is not unexpected. In fact, structurally mutated se- 
quences are only one point mutation apart from selected 
sequences, thus C* should still have a low energy and 
should decreases the stability of Co- We shall comment 
further on this point in the conclusions. 

The Z score correlates well to the native overlap 
(q(C,C*)) that we assumed as a measure of thermody- 
namic stability (FigJ|) for unstable sequences (correla- 
tion coefficient r = —0.48) and for mutated sequences 
(this is due to the fact that the overlap between the mu- 
tated ground state and the native state correlates with 
to the Z score). No correlation is visible for selected 
sequences, which always have (q(C,C*)) > 0.9. For un- 
folded sequences the measure of (q(C, C*)) is not possible 
(see Fig.g). 
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FIG. 6. Average value of the native overlap as a function 
of the Z score Z(C). 



A. Selected sequences 

With our criterion we accepted 566 sequences in six 
evolutive trajectories. Such sequences are both fast fold- 
ing and thermodynamically stable. These properties are 
not typical of random sequences Q . 

For selected sequences there is a significant correlation 
between the energy E of a conformation and its overlap 
q with the native state. In Fig. we represent the 500 
lowest energy configurations of three different sequences 
as points in the (E, q) plane. For every sequence, all 
points fulfill the inequality 

1 - E(C, S)/E(C*, S) > a(S) (1 - q(C, C*)) . (17) 

The adimensional parameter a(S) is related to the energy 
gap of the ground state C* of sequence S, as defined by 
Shakhnovich and coworkers |Q. However, it character- 
izes more precisely the smoothness of the energy land- 
scape. For the initial sequence we find a(So) = 0.23. We 
did not measure a(S) for all selected sequences, but it 
appears from few examples that its value does not de- 
crease during the evolution of the protein. It is not sur- 
prising that selected sequences exhibit a correlated en- 
ergy landscape, since the condition of thermodynamic 
stability that we impose rules out sequences with mis- 
folded structures of low energy. Moreover, our selected 
sequences are fast folders, and one should expect that 
such sequences have a correlated energy landscape, since 
the relation between thermodynamic stability, fast fold- 
ing and smoothness of the energy landscape has long been 
discussed p4] , p3[ . Furthermore, it has been found that 
models which cannot give rise to good folding sequences 
present weak correlations between q and E p5|j2^ |. 

Consistently, it appears from our data that the fold- 
ing time is correlated to the stability (it has a negative 
correlation with the Z score and positive with the na- 
tive overlap), but we are not able to quantify this effect, 
since our measure of the folding time, based only on two 
simulations, is too imprecise. 
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FIG. 7. Correlations between the energy and the similarity 
with the ground state for the 200 lowest energy structures of 
three selected sequences. 

It is also interesting that for several sequences the en- 
ergy E(C* , S) is lower than for the starting sequence S*, 
although this has been obtained by minimizing the en- 
ergy E(C*,S) in sequence space. The reason for this is 
that the minimization method requires that the composi- 
tion of the sequence is kept fixed, while we do not impose 
this condition. However, the Z score reaches its lowest 
value Z = —1.22 for the starting sequence S*. 

We observe that the Z score of the target state, 
Z(C*, S), defines a complex landscape in sequence space, 
with valleys separated by barriers. This result is illus- 
trated in Fig.||, where we show the Z score of sequences 
in the neutral network as a function of the number of 
steps I along the network, starting from S*. 




20 40 60 80 

Steps in sequence space 

FIG. 8. Z score of sequences in the neutral network as a 
function of the number of steps along the network, starting 
from S*, for two evolutionary trajectories. 

The roughness of the energy landscape is consistent 
with the results reported in a recent preprint by Tiana et 
al. ptifl , who sampled the energy landscape in sequence 
space for a fixed structure using Monte Carlo simula- 
tions. The authors of |2(J observed a hierarchy of clus- 
ters and superclusters of low energy sequences, with su- 
perclusters characterized by few fixed amino acids in key 



positions and not connected by neutral paths. This con- 
clusion might at first seem at odd with the fact that 
we found connected neutral paths extended in sequence 
space, but there should be no contradiction between the 
present work and the results of ref . |^(| , since they deal 
with different questions. We studied a single neutral net- 
work asking whether it is possible to find in it pairs of 
sequences at any distance, while Tiana et al. 2C}| ask 
whether it is possible to find non homologous proteins 
which are in disconnected neutral networks and still fold 
to the same structure. It is possible that both answers are 
positive and that several extended but disconnected neu- 
tral networks exist for a given protein structure. Such a 
picture, if correct, implies that non homologous proteins 
sharing the same fold may have been originated either 
through convergent evolution, possibly on disconnected 
neutral networks, or through divergent evolution from a 
common anchestor on a single neutral network, but it is 
very difficult or even impossible to decide between these 
two possibilities on the basis of the sequence alone. 

On the other hand, we should note that the definition 
of neutral path in our work is different than the one used 
in pfjfl . In fact, while in |2(J is assumed that a path in 
sequence space is neutral if all sequences belonging to it 
have Z score lower than a predetermined threshold, we 
accept only sequences for which we can show through 
Monte Carlo simulations that the target structure is the 
ground state, it is thermodynamically stable and easy 
to reach kinetically. The criterion adopted in |2(]] has 
the advantage of being computationally very efficient and 
it correlates well with our criterion. In several cases, 
however, the two criteria give different answers, as it is 
shown in Fig. |^, and it is possible that networks which 
are disconnected according to the Z score criterion are 
found to be connected according to our criterion. Further 
study is necessary in order to assess the relevance of such 
possibility. 

It is rather interesting that the complex energy land- 
scape in sequence space offers an explanation for the large 
variations of the neutral mutation rate, x(S). Since a 
more stable sequence can tolerate larger rearrangements 
without changing its ground state, one can expect that 
the fraction of neutral mutations from sequence S in- 
creases with the stability of the sequence. This expecta- 
tion is in agreement with the results of ref. ppf , where 
the stability is measured by the Z score Z(C*,S) and 
neutrality of a mutation is recognized with a criterion 
based on the Z score of the mutated sequence. 

We study the correlation between the stability, mea- 
sured either by the native overlap or by the Z score, and 
the fraction of neutral neighbors x(S). Since we did not 
measure x(S), we have to rely on the trapping time Tt 
spent by a trajectory on sequence S. This variable is re- 
lated to x(S) through the geometric distribution P x (t) = 
x(l — xY^ 1 of average value l/z(S) (0). We thus esti- 
mate the correlation coefficient between the Z score and 
l/x(S), using the relations [Z(S)/x(S)} « [Z(S)r t (S)], 
[IMS)] « [r t (S)], [IMS) 2 ] « l/2[r t 2 (S) + r t (S)], 
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where the square brackets indicate average on the neu- 
tral network. This treatment neglects the fact that our 
criterion is subject to some arbitrariness, since we can 
not measure with high precision the native overlap (q) 
and the typical folding time upon which our criterion 
is founded. Thus, the correlation coefficient estimated 
in this way is underestimated. We find a correlation 
coefficient r = 0.20 between the Z score and 1/x and 
r = —0.21 between the native overlap and 1/x. Although 
the estimate is not accurate, this study confirms the ex- 
istence of correlations between the stability of the native 
state and the neutral mutation rate. We show the corre- 
lation between Z and r t in Fig.^J. 
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B. Unstable and slow folding sequences 

To these two classes belong all sequences that were re- 
jected although no structure of energy lower than C* was 
found. We do not know for which fraction of these se- 
quences the ground state coincides with C* and for which 
the ground state is changed. For unstable sequences C* 
is the lowest energy structure found and it is reached in 
at least one of the two independent MC runs, but the sta- 
bility condition is not fulfilled. The rejection was made 
already in the first MC run if we found (q) > 0.75. Thus 
we can not exclude that in the second run also the fast 
folding condition would fail. For slow folding sequences 
C* was not reached either in the first or in the second MC 
run. The folding time for unstable sequences appears to 
be correlated to the native overlap (q), even if our data 
do not allow quantitative estimations. 

C. Structurally mutated sequences 

For sequences in this class we found putative ground 
states with energy lower than that of the target structure. 
A fraction / = 0.306 of the examined sequences belongs 
to this class. 

We first analyze the number N® of contacts in the 
mutated ground state. The distribution P(N®) is rem- 
iniscent of a bimodal distribution (see Fig. [H]). As 
previously mentioned, the number of target contacts, 
N* =40, is the largest possible for a sequence length 
of N — 36 residues. The twin peaks at A° = 37 and 
at N® = 40 derive from target-like ground states, while 
the constraints of the lattice geometry are probably re- 
sponsible for the depression of the values of P(38) and 
P(39). The broad peak at A°=34 is close to the num- 
ber of contacts expected for random sequences, although 
slightly higher. As mentioned above, in the case of a ran- 
dom contact potential with the same mean and variance 
as the one that we used and a Gaussian distribution, the 
number of contacts in the ground states ranges typically 
from 29 to 33 ||. 



FIG. 9. Average value of r t for sequences of the neutral 
network as a function of the Z score (upper panel) and of the 
average native overlap (lower panel). 

The previous observation can be the basis for a quanti- 
tative explanation of the large fluctuations of the neutral 
mutation rate in this model and possibly in real proteins. 
It would be very interesting to investigate to which ex- 
tent such fluctuations are responsible for the observed 
patterns of molecular evolution, which appear to be much 
more irregular than predicted on the basis of the simple 
"homogeneous" neutral theory. 
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FIG. 10. Distribution of the number of contacts in the 
ground state, N c , for all of the generated sequences. The 
peaks at high N c axe related to target-like structures. Insert: 
correlation between the number of contacts and the overlap 
qo between target state and ground state. 

The number of contacts is higher than for random se- 
quences because the "native" contacts of C* are advan- 
tageous even in the structurally mutated sequences. This 
is confirmed by the fact that there is a strong correlation 
correlation between the number of contacts N® and the 
overlap qo between the new ground state and the target 
state C* The correlation coefficient is r = 0.80. See in- 
sert in Fig.[l0| The two quantities are also correlated to 
the energy Eq of the ground state: The more native-like 
the mutated ground state is, the more compact it is and 
the lower is its energy. The correlation coefficients are: 
r = -0.31 between q and E , r = -0.32 (iV c ° and E ), 
r = -0.54 (<jo and Z score) and r = -0.47 (N° and 
Z score). The strongest correlation observed is that be- 
tween go and the Z score. The energy and the Z score 
are weekly correlated (r = 0.36). 

The distribution of q is also bimodal (see Fig. 



tion that biological evolution conserves the native fold of 
proteins even when their function changes substantially 
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The peak at high q is due to native-like structures and 
the broader peak at q = 0.3 is close to (but still signifi- 
cantly higher than) the typical overlap between random 
structures, q rcm — 0.1 p| for chains of length N = 36. 
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FIG. 11. Distribution of the overlap q(C* , C) between the 
target structure C* and the ground-state of the sequences 
studied, C. 

The overlap qo between the ground state and the tar- 
get state is negatively correlated to the Z score. This 
is shown in Fig.[l2[ where all the structurally mutated 
sequences are represented in a scatter plot in the plane 
(qo,Z). Only structures which are very similar to the 
original native state have a Z score in the range found 
for selected sequences. This result suggests that mutated 
ground states dissimilar from the original one are in most 
cases not stable enough to represent a new acceptable fold 
of the model protein. In other words, neutral networks of 
unrelated structures may lay far apart in sequence space. 
This feature of the model is consistent with the observa- 
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FIG. 12. Scattering plot showing the Z score of the ground 
state as a function of its overlap with the target state. 



V. DISCUSSION 

We simulated evolution on neutral networks for a pro- 
tein model with twenty amino acid types, contact en- 
ergy functions and structures represented as self avoiding 
walks on the simple cubic lattice. Stability of the ground 
state is measured as the thermodynamic average of its 
overlap with alternative structures. 

Our simulations show that neutral networks are ex- 
tended in sequence space: pairs of sequences on the neu- 
tral network are almost as different as random sequences, 
even if they have exactly the same fold. This observation 
is consistent with what is known about protein evolu- 
tion. In our 36mer chains, residues in all positions could 
be substituted, even if the two positions in the core are 
much more difficult to change, consistently with the re- 
sults of Ref. |(]]. 

The neutral network that we studied turned out to be 
rather irregular: the fraction x(S) of neutral mutations of 
a sequence S has large variations in sequence space. The 
value of x(S) is positively correlated with the stability 
of the native state, evaluated either through the average 
native overlap (g(C,C*)) or through the Z score. This 
appears very reasonable: The more stable is a sequence 
to structure match, the less probable is that a mutation 
destabilizes it. This is also in agreement with the results 
of recent complementary studies |3(^(|. It thus seems 
that thermodynamic stability implies also stability with 
respect to mutations. 

The fluctuations of x(S) in sequence space are impor- 
tant for the evolutionary dynamics. If x(S) is constant, 
Kimura's theory of neutral evolution predicts that the 
overlap in sequence space relaxes exponentially to the 
asymptotic value corresponding to random sequences, at 
a rate independent of the size of the biological popula- 
tion in which the evolution takes place. Moreover, the 
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number of substitutions in a protein sequence during an 
evolutionary trajectory lasting T generations is predicted 
to be a Poissonian variable of mean value \xxT . In this 
case, the fluctuations of the value of this variable in differ- 
ent species evolving for a time T should grow only as Vt. 
This result would sustain the molecular clock hypothesis, 
according to which the substitution process can be used 
as a clock to date speciation events. However, the molec- 
ular clock hypothesis has been heatedly debated in the 
last decade, and it has been shown that the number of 
substitutions fluctuates much more than predicted in the 
"homogeneous" neutral theory ||. This deviation from 
the prediction of Kimura's theory has been interpreted 
as an indication that in most cases protein evolution is 
not neutral but adaptive. We suggest that the irregu- 
larity of protein evolution could be an intrinsic property 
of the energy landscape of neutral networks of protein 
sequences. More stringent statistical tests should be de- 
signed to distinguish this situation from adaptive evolu- 
tion, that undoubtedly occurs in many cases. 

Selected sequences have a rather correlated energy 
landscape, which yields short folding times and high ther- 
modynamic stability. The native overlap that we use as a 
measure of thermodynamic stability correlates well with 
the Z score but it gives more information on the absence 
of states of low energy unrelated to the native state and 
favors a more correlated energy landscape. 

Sequences whose ground state coincides with the na- 
tive state may be discarded either due to the lack of 
thermodynamic stability or because they fold too slowly. 
Both classes of sequences have similar properties, since 
stability and folding time are related quantities. 

For about 30% of the attempted mutations the result- 
ing sequence has a ground state different than the original 
one. The overlap qo of this new ground state with the tar- 
get state has a bimodal distribution, but only structures 
very similar to the target one appear to fulfill our crite- 
rion of thermodynamic stability. This is not surprising, 
since the target structure must conserve a low energy in 
the mutated sequence, so that it is able to destabilize the 
new ground state. Therefore, it seems that contacts be- 
tween neutral networks of unrelated structures are very 
rare, if a stability condition is required. A similar con- 
clusion has been suggested in a numerical study of the 
two dimensional HP model [Q. 

An implication of this result is that it is difficult to 
switch from a structure to a different one through point 
mutations corresponding to thermodynamically stable 
proteins. This could explain why evolution changes so 
rarely the fold of a protein, while it is possible to engi- 
neer protein sequences with as much as 60% similarity 
with a natural protein and completely different fold. 

We studied a model which has the advantage of being 
reliably computable, but at the price of sacrificing possi- 
bly important ingredients. We discuss here the ones that 
we judge the most serious: 

1. We use a simple lattice model. This choice was 



made due to the necessity of identifying the na- 
tive state of each generated sequence, and this is 
feasible only for lattice models. Lattice models, al- 
though often criticized j37j, have been recognized 
to capture some of the most relevant thermody- 
namic features of the folding process [p8[ , such as 
the existence of a unique ground state and the co- 
operativity of the transition. However, they do not 
capture essential features of real proteins as for in- 
stance the existence of secondary structures. 

2. We simulated the evolution of only one target struc- 
ture. It would be interesting to see how our results 
change by changing the structure, and which prop- 
erties of the structure (for instance compactness, 
locality of interactions and so on) are important to 
determine the neutral mutation rate. However, it 
was argued that the small number of folds occurring 
in natural proteins (at most some thousands) could 
be the ones corresponding to the largest number of 
sequences in sequence space [^9| , so that structures 
characterized by a large neutral set, even if they 
are not typical, could be the most interesting ones 
from the biological point of view. 

3. The length of the sequences examined is short, so 
that there are only two core residues. Considering 
more core residues could impose more constraints 
on the evolution and reduce the rate of neutral evo- 
lution. It would thus be interesting to make the 
same study for longer sequences. 

4. We did not represent biological activity in the 
present protein model. This might be obtained by 
imposing more constraints on the residues taking 
part to the active site. 

5. In our model of evolution we assume that the en- 
vironment remains fairly constant, so that the na- 
tive structure favored by natural selection does not 
change throughout the evolution. This hypothesis 
is not unreasonable if the protein examined is an 
enzyme performing some chemical activity, since 
the cells possess a high homeostasis, i.e. they can 
maintain a stable chemical-physical internal envi- 
ronment despite large perturbations in the external 
environment. However, it is quite likely that some 
large ecological and climatic changes have been re- 
sponsible for molecular substitutions for which the 
neutral theory, and our model in particular, do not 
apply §. 

6. We consider only point mutations, and not inser- 
tions and deletions, which also play an important 
role in evolution. 

In our opinion, these limitations should not modify the 
qualitative picture. The existence of neutral networks, 
the variability of neutral mutation rates and the difficulty 
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to reach through point mutations very different struc- 
tures corresponding to stable proteins are features of our 
model that appear to be reflected also in the evolution of 
real proteins. 
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